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(f) ' Abstract 

Building on ideas from Castillo and Nickl [1], a method is provided to study nonpara- 
metric Bayesian posterior convergence rates when 'strong' measures of distances, such as the 
sup-norm, are considered. In particular, we show that likelihood methods can achieve op- 
timal minimax sup-norm rates in density estimation on the unit interval. The introduced 
methodology is used to prove that commonly used families of prior distributions on densities, 
namely log-density priors and dyadic random density histograms, can indeed achieve optimal 
\f\ ' sup-norm rates of convergence. New results are also derived in the Gaussian white noise 

model as a further illustration of the presented techniques. 
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In the fundamental contributions by Ghosal, Ghosh and van der Vaart [7], Shen and Wasserman[19] 
and Ghosal and van der Vaart [9], a general theory is developed to study the behaviour of Bayesian 
posterior distributions. A main tool is provided by the existence of exponentially powerful tests 
between a point and the complement of a ball for some distance. A number of important distances, 
such as the Hellingcr distance between probability measures, indeed guarantee the existence of such 
tests. The theory often also allows extensions to other metrics, for instance L 2 -type distances, 
but the question of dealing with arbitrary metrics has been left essentially open so far. Although 
a general theory might be harder to obtain, it is natural to consider such a problem in simple, 
canonical, statistical settings first, such as Gaussian white noise or density estimation. This is 
the starting point of the authors in Gine and Nickl [10], and this paper was the first to provide 
f~^ \ tools to get rates in strong norms, such as the L°°-norm. Exponential inequalities for frequentist 

(T) ■ estimators are used in [10] as a way to build appropriate tests, and this enables one to obtain some 

rates in sup-norm in density estimation. In the case where the true density is itself a convolution 
and the minimax rate is then nearly parametric, the minimax sup-norm rate is attained, at least 
up to a possible logarithmic term, see also the work by Scricciolo [18] for related results. In the 
general case where the true density belongs to a Holder class, a sup-norm rate is obtained which 
differs from the minimax rate by a polynomial power of n. On the other hand, by using explicit 
computations, the authors in [10] show that in the Gaussian white noise model with conjugate 
Gaussian priors, minimax sup-norm rates are attainable, which leads to the natural question to 
know whether this is still possible in density estimation, or in non-conjugate regression settings. 
This non-trivial question also arises for other likelihood methods, such as nonparametric maximum 
likelihood estimation, see Nickl [13]. 

Here we establish that minimax optimal sup-norm rates of convergence in density estimation 
are attainable by common and natural Bayes estimators. The methodology we introduce is in fact 
related to a programme initiated in [4], namely nonparametric Bernstein- von Mises type results, 
as discussed below. The testing approach is replaced here by tools from semiparamctric Bernstcin- 
von Mises results (testing is still typically useful to establish preliminary rates). We break the 
distance of interest in simpler pieces, each simpler piece being a semiparamctric functional to 
study. One novelty of the paper consists in providing well-chosen, uniform approximation schemes 
of various influence functions appearing at the semiparametric level when estimating those simple 
functionals. 
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Two natural families of nonparametric priors are considered for density estimation: priors on 
log-densities, see e.g. Ghosal, Ghosh and van der Vaart [7], Scricciolo [16], Tokdar and Ghosh 
[20], van der Vaart and van Zanten [23], [1], Rivoirard and Rousseau [14], and random dyadic 
histograms, for random histogram priors see e.g. Ghosal and van der Vaart [9], Scricciolo [17], 
Gine and Nickl [10] and the recent semiparamctric treatment in [5]. 

New results are also derived in the Gaussian white noise model, in the spirit of [4], for non- 
conjugate priors. 

While working on this paper, we learned from the work by Marc Hoffmann, Judith Rousseau 
and Johannes Schmidt-Hieber [12], which independently proves sup-norm properties for different 
priors. Their method is different from ours, and we believe that both approaches shed light on 
different specific aspects of the problem. 

Let L 2 [0,1] and L°°[0,1] respectively denote the space of square integrable functions with 
respect to Lebesgue measure on [0, 1] and the space of measureable bounded functions on [0, 1]. 
Theses spaces are equipped with their usual norms respectively denoted || • H2 (denote by (-,-) 2 
the associated inner product) and || ■ ||oo. Let C a := C a [0, 1] denote the class of Holder functions 
on [0, 1] with Holder exponent a > 0. 

For any a > and any n > 1, denote by e n ^ a the rate 

en, Q :=n-5*r. (1) 

The typical minimax rate over a ball of the Holder space C a [0, 1], a > 0, for the sup-norm is 

Let us also set, omitting the dependence in a in the notation, 



fc»=li ' L n =\log 2 (l/h n )\. (3) 

V log n I 

For a statistical model {Pf } indexed by / in some class of functions to be specified and 
associated observations X^ n \ denote by /o the 'true' function and by E? the expectation under 
Pf . Given a prior n on a set of possible /s, denote by n[ • | X^ n '] the posterior distribution and 
by E n [ ■ I X( n '] the expectation operator under the law II [ ■ | X^]. 

The symbol < means less or equal to up some universal constant. 

Acknowledgements. This work started from ideas developed in the work [4] in common with 
Richard Nickl. His numerous helpful comments are warmly acknowledged. I am also indebted to 
Judith Rousseau for early explanations on semiparametric analysis of histograms priors. Although 
the treatment for histograms here is a bit different, it owes to her ideas, and we refer to [5] for a 
systematic treatment of histograms for fixed functionals. I also would like to thank Catia Scricciolo 
for early discussions and Gerard Kerkyacharian and Dominique Picard for helpful comments. 

2 Prologue 

Let us start by a simple example in Gaussian white noise which will serve as a -slightly naive yet 
useful- illustration of the main technique of proof. 

Let / be an element of L°°[0, 1]. Let n > 1. Suppose one observes a realisation 

dX<- n \t) = f(t)dt + l=dW(t), t G [0, 1], (4) 

where W is standard Brownian motion. Let {ipik, I > 0, < k < 2 l — 1} be a wavelet basis on 
the interval [0, 1]. Here we take the basis constructed in [6], see below for precise definitions. The 



model (4) is statistically equivalent to observing the projected observations onto the basis {il>ik}, 

xik = fik + -7=eik, l>0, 0<k<2 l -l, 

V n 

where fa := (f,ipik) 2 an( l Elk arc i-i-d. standard normal. Denote fa := xik, an efficient frcqucntist 
estimator of the wavelet coefficient fa . 

2.1 A first example 

Suppose the coefficients of the true function /o verify, for some R > that we suppose to be known 
in this first example, and a > 0, 

sup 2 l ^ +a ^\f , lk \<R. (5) 

Z>0,0<fc<2'-1 

Define a prior II on / via an independent product prior on its coordinates fa onto the considered 
basis. The component fa is assumed to be sampled from a prior with density uV <p(-/a{) with 
respect to Lebesgue measure on [0, 1], where, for a, R as in (5), i£K and a given B > R 

¥>(*) = ^1-b.b](x), cn = 2-«*+«). (6) 

For L n defined by (3), let us denote by / L " the orthogonal projection in L 2 [0,1] of / onto 
Vect{V>ifc, I < L n , < k < 2 1 } and / L " the projection of / onto Vect{V>ifc, / > L„, < k < 2 1 }, 

f-fo = f Ln - f L " + f Ln - fo n + f K - fo° n , 
where / ™ is the projection estimator onto the basis {ipik} with cut-off L n . Decomposing as above, 

E n [\\f - /olU I x^] = j\\f- /olU<ni(/ 1 * (,l) ) 

<J\\f L " f L -\\oodu(f\x^) + 1 n/^iUdnc/ixW) + ||/ L " - /Mu + n/^iu 

=: (i) + (ii) + (Hi) + (iv). 

The term (iv) is purely bias and, using (5) and the localisation property of the wavelet basis 
II Sfe IV'JfcHloo ^ 2'/ 2 , see below, one obtains 



l/ L "IU< E 



'lloo S 

1>L„ 



max|/ ,zfe||| Y^l^fel 

k *• — ' 



< h a < e* 



n rv ^n.a' 



k 

The term (iii) depends on the randomness of the observations only, namely, 



(m) = l|/ i --/o i "Hoo = -^|| E e '*^*(')l 



KL„,k 



This is bounded under El by a constant times e* n a with high probability, see Lemma 7 for a 
proof in the -more difficult- case of empirical processes. 

Small I (I < L n ). By definition / L ™ has coordinates fa in the basis {tpik}, so using the 
localisation property of the wavelet basis as above, one obtains 



| f L„_ fL n u < J_ V- 2 l/2 



n 

1<L„ 



max v^l/ifc -a;/fc| 

0<fe<2' 



For t > and any / > 0, via Jensen's inequality and bounding the maximum by its sum, 

2 ! -l 

tE? E n \ max y/n\fi h - x lk \ I X^} < log ( V E n f E n [ e V*(Aib-xuO + e -tV5K/»-*i*) l ^W 

J 0<fc<2< ' V ^^ ' ,0 

fc=0 

Simple computations presented in Lemma 1 yield a sub-Gaussian behaviour for the expected 
Laplace transform of \/n(fik—Xik), which is bounded above by Ce* ' 2 for a constant C independent 
of /, k. From this deduce that, for any t > and I < L n , 

E%E^\ max ^\f lk - x lk \ \ *(«>] < ^^ + \. 

0<fc<2< t 2 



The choice t = y / 21og(C2') leads us to control the expectation of the term (i) above by 



l<L n 



Large I (I > L ny ). Under the considered prior, which is bounded, using again the localisation 
property of the wavelet basis, the term (ii) above is bounded in expectation by 



max|/ ;fe | |XW 



l>L n 

< Y, 2 l/2 <n < K = <«■ 

We have obtained the following proposition 

Proposition 1. Consider observations X^ n ' from the model (4). Let a and /o satisfy (5) emd /e^ 
the prior be chosen according to (6). Then there exists M > such that for e* Q defined by (2), 

^|||/-/o||oodn(/|JfW)<M< a . 

Although fairly simple, the previous example is revealing of some important facts, some of 
which are well-known from frcqucntist analysis of the problem, some being specific to the Bayesian 
approach. The previous proof shows two regimes of frequencies: I < L n 'low frequency' and I > L n 
'high frequency'. In the low frequency regime, the estimator xik of fik = (/, ^;fc) 2 is satisfactory, 
and the concentration of the posterior distribution around this efficient frcqucntist estimator is 
desirable. This is reminiscent of the Bernstein-von Mises (BvM) property, see van der Vaart 
[22], Chap. 10, which states that in regular parametric problems with unknown parameter 9, 
the posterior distribution is asymptotically Gaussian concentrating at rate l/\/n and centered 
around an efficient estimator of 9. Here are a few words on the general philosophy of the results 
specifically in the Bayesian context. Such method was used as a building block in [4]. The idea 
is to split the distance of interest in small pieces. For the sup-norm those pieces can for instance 
be the coefficients (f,tpik)o " but not necessarily, as will be seen for log-density priors -. Next one 
analyses each piece separately, with different regimes of indexes I, k often arising, requiring specific 
techniques for each of them. 

• the BvM-regime: semi-parametric bias. For 'low frequencies', what is typically needed is a 
concentration of the posterior distribution for the functional of interest at rate 1/y/n around 
a scmiparametrically efficient estimator of the functional. This is at the heart of the proof of 
semiparametric BvM results, hence the use of BvM techniques. In particular, sharp control of 
the bias will be essential. Regarding the BvM property, although the precise Gaussian shape 
will not be needed here, one needs uniformity in all frequencies in the considered regime. 
This requires non-trivial strengthenings of BvM- type results, the semiparametric influence 
function of the functional of interest, which can be for instance a recentered version of ipik, 
being typically unbounded as I grows. 



• 



• 



To take care of uniformity issues in approximation of the efficient influence functions by the 
prior may require various approximations regimes depending on I. For log-density priors, we 
will indeed sec various regimes of indexes T arise in the obtained bounds for the bias. 

The high-frequency bias corresponds to frequencies where the prior should make the likelihood 
negligible. This part can be difficult to handle too, especially for unbounded priors. 



In the example above for uniform priors in white noise, most of the previous steps are either almost 
trivial or at least handablc by considering the explicit expression of the posterior, but for different 
priors or in different sampling situations some of the previous steps may become significantly 
harder, as we will see below. 

2.2 Wavelet basis and Besov spaces 

Central to our investigations is the tool provided by localised bases of L 2 [0, 1]. We refer to the 
Lecture Notes by Hardlc, Kerkyacharian, Picard and Tsybakov [If] for an introduction to wavelets. 
Two bases will be used in the sequel. 

The Haar basis on [0, 1] is defined by ip H (x) — 1, ip H (x) := ipQ t0 (x) = — ll[ 0] i/2](x) + U(i/2.i]( a; ) 
and V>fj.(x) = 2 l / 2 ijj{2 l x — fc), for any integer I and < k < 2 l — 1. The supports of Haar wavelets 
form dyadic partitions of [0, 1], corresponding to intervals I l k := (k2~ l , (k + 1)2"'] for k > 0, and 
where the interval is closed to the left when k = 0. 

The boundary corrected basis of Cohen, Daubechies, Vial [6] will be referred to as CDV basis. 
As for the Haar basis, this basis enables a treatment on compact intervals, but at the same time 
can be chosen sufficiently smooth. A few properties are lost, essentially simple explicit expressions, 
but most convenient localisation properties and caracterisation of spaces are maintained. Below 
we give the properties of the CDV basis that are used in the sequel. We denote this basis {i/jik}, 
with indexes ?>0,0<fc<2' — I (with respect to the original construction in [6], one starts at 
a sufficiently large level I > J, with J fixed large enough; for simplicity, up to renumbering, one 
can start the indexing at I = 0). Let a > be fixed. 

• {tpik} forms an orthonormal basis of L 2 [0, 1] 

tpik have support 5*;^, with diameter at most a constant (independent of I, k) times 2~ l , and 
HV'zfclloo < 2'/ 2 . The ipikS arc in the Holder class C s [0, 1], for some S > a. 



• 



At fixed level /, given a fixed ipik with support S, 



lk, 

o the number of wavelets of the level I' < / with support intersecting Sik is bounded by 
a fixed constant (independent of V ', /, fc) 

o the number of wavelets of the level V > I with support intersecting Sik is bounded by 
2 l ~ l times a fixed constant (independent of V , /, fc). 

The following localisation property holds ^2 k = Q ||"0zfc||oo ^5 2'/ 2 , where the inequality is up 
to a fixed universal constant, independent of I. 

• The constant function equal to I on [0, 1] is orthogonal to high-level wavelets, in the sense 
that (tpik, 1)2 = In iplk = whenever I > M, for a large enough constant M. 

• The basis {4>ik} characterises Besov spaces B^ ^[0,1], any s < a, in terms of wavelet 
coefficients. That is, g € -B^o.oo[0, 1] if and only if 

IM|oo,oo,8 := max 2 l ^ +s) \{g, ip ik ) 2 | < 00. 

Z>0, 0<fc<2'-l 

Finally, recall that B^ ^ coincides with the Holder space C s when s is not an integer and that 
when s is an integer the inclusion C s C B^ ^ holds. 



3 Main results 

3.1 Gaussian white noise 

Consider priors II defined as coordinatewise products of priors on coordinates specified by a density 
tp and scalings {a{\ as in Section 2.1. The next result allows for a much broader class of priors. 

Let tp be a continuous density with respect to Lebesgue measure on R. We assume that tp is 
(strictly) positive on [— 1, 1] and that it satisfies 

3b 1 ,b 2 ,c 1 ,c 2 ,6>0, Vsc:|x|>l, cre" 61 ^ 1 ^ < tp(x) < c 2 e' b ^ 1+5 . (7) 

Consider a scaling o~i for the prior equal to, for S the constant in (7), 

2-«*+») 1 

^fl + ip " = i+? (8) 

Theorem 1. Let X^ n > be observations from (4). Suppose /o belongs to B^ ^[0,1], for some 
a > 0. Let the prior H be a product prior defined through tp and o~i satisfying (7), (8). Then there 
exists M > such that for e* a defined by (2), 



E] o J \\f - fo\\oodn(f \X^) < Mel 



Theorem 1 can be seen as a generalisation to non-conjugate priors of Theorem 1 in [10]. 
Conditions on tp are mostly for simplicity of presentation, but similar results can be obtained for 
a variety of related priors, e.g. uniform priors, Laplace priors (up to a logarithmic factor in the 
rate) etc. The reason of some tail control in (7) is to handle priors without high-frequency cut-off. 
This way, the results are obtained for canonical priors, in that they do not depend on n. Results 
for priors with cut-off I < K n , for some K n — ¥ oo, can be obtained along the same lines. 

3.2 Density estimation 

Consider independent and identically distributed observations 

XM = (X u ...,X n ), (9) 

with unknown density function / on [0, 1]. We use the same notation X^ for observations as in the 
white noise model: it will always be clear from the context which model we are referring to. Let T 
be the set of all densities / on [0, 1] for which there exist constants p, D with 0<p<f<D<oo. 

3.2.1 Log-densities priors 

Define the prior II on densities as follows. Given a sufficiently smooth CDV-wavelet basis {t/jik}, 
consider the prior induced by, for any x £ [0, 1] and L n defined in (3), 

t(x) = y^ y^ o-ikuikipikjx) (io) 

1=0 k=0 

f(x) = cxp {T(x) - c(T)} , c(T) = log f e T ^dx, (11) 



where aik are i.i.d. random variables of density <p with respect to Lebesgue measure on R and o~ik 
are positive reals which for simplicity we make only depend on I, that is a^ = o~i. We consider 
the choices <p(x) — <pq{x) — e~ x ' 2 /\2n the Gaussian density and cp(x) = tpji{x), where ipn is 
any density such that log</3# is Lipschitz on R -we call this 'Heavy- Tails case'-. For instance, the 



aik's can be Laplace-distributed or have heavier tails, such as, for a given < r < 1 and x S R, 
and c r a normalising constant, 

VhA*) = ^ exp{-(l + M) 1 ^}. (12) 

Suppose the prior parameters 07 satisfy, for some a > 1/2 and < r < a, 

oi > 2~ l( - a+ ^ (Heavy Tails-case), 07 = 2^ ( 5+ r ) (Gaussian-case). (13) 

Typically, see examples below, such priors / in (11) under tp = ipa or ipu and (13) attain the rate 
£n,a hr (1) i n terms of Hellinger loss, up to logarithmic terms. For some v > 0, suppose 

E%U[f : h(fj ) > Oogn) v E n , a \XM] ^ . (14) 

If (14) holds for some v > 0, we denote e„ := (logn) y e„ jQ , and ( n := e„2 L "/ 2 , with L„ as in (3). 

Theorem 2. Consider observations X^ n > from model (9). Suppose log/o belongs to C a [0, 1], with 
a > 1. Lei II 6e i/ie prior on T defined by (11), wit/i y? = </?g or <p#. Suppose that o~i satisfy (13) 
and i/iai (14) holds. Then, for a > 1 and e* Q defined by (2), any M n — > oo, it holds, as n — > oo, 

^ ( n[/ : ||/ - /o|U > M n£ ; >a | xW] -+ o. 

In t/ie case a = 1, £/ie same /icdds wii/i e* Q replaced by (logn) T 'e* Q , /or some r\ > 0. 

Theorem 2 implies that log-density priors for many natural priors on the coefficients achieve 
the precise optimal minimax rate of estimation over Holder spaces under sup-norm loss, as soon 
as the regularity is at least 1. 

In the case 1/2 < a < 1, examination of the proof reveals that the presented techniques 
provide the sup-norm rate p n = n^2^ 1 f-)/( 1 + 2Q ) U p to logarithmic terms. For 1/2 < a < 1, 
we have e* n a <C p n <C £ n . So, although the minimax rate is not exactly attained for those low 
regularities, the obtained rate improves on the intermediate rate Cm which was obtained in [10] 
for slightly different priors. In the next subsection a prior is proposed which attains the minimax 
rate for the sup-norm in the case 1/2 < a < 1. 

Let us give some examples of prior distributions satisfying the assumptions of Theorem 2. In 
the Gaussian case, any sequence of the type 07 = 2 _ ^2+t) with < 7 < a satisfies both (13) and 
(14). In the heavy-tails case, the choice ip = ph,t in (12) with any < r < 1 combined with 
07 = 2~ la satisfies (13)-(14). Both claims follow from minor adaptations of Theorem 4.5 in [23] 
and Theorem 2.1 in [14] respectively, see Lemma 8. In both Gaussian and Heavy Tails cases, we 
in fact expect (14) to hold true for many other choices of 07 under (13) and log^ij Lipschitz, or 
under 07 > 2~ l ^ +a > in the Gaussian case, although such a general statement in Hellinger distance 
is not yet available in the literature, to the best of our knowledge. 

3.2.2 Random dyadic histograms 

Associated to the regular dyadic partition of [0,1] at level Lei', given by Iq = [0, 2~ L ] and 
1% = (fc2~ L , (k + 1)2~ L ] for k = 1, . . . , 2 L — 1, is a natural notion of histogram 

2 L -\ 

H L = {heL oo [0,l], h(x)= J2 h k ^(x), h k eM, fc = 0,...,2 L -l} 

fc=0 

the set of all histograms with 2 L regular bins on [0, 1]. Let Sl = {<->J <= [0, l] 2 ; J2k=o w fe = 1} ^ e 
the unit simplex in K 2 . Further denote 

2" L -l 

H 1 L = {feL°°[0,l], f(x) = 2 L J2 "*%(*), (wo,...,w 2 ^-i)G5i}. 

fe=0 



The set %\ is the subset of Hl consisting of histograms which are densities on [0, 1]. Let H l be 
the set of all histograms which are densities on [0, 1]. 

A simple way to specify a prior on 1-L\ is to set L — L n deterministic and to fix a distribution 
for ui := (u)q, . . . , w 2 i_i). Set L = L n as defined in (3). Choose some fixed constants a, c\, Ox > 
and let 

L = L n , Wi~D(a 0> ... J a a r._i) J ci2~ La < a k < c 2 , (15) 

for any admissible index k, where T> denotes the Dirichlet distribution on Sl- Unlike suggested 
by the notation, the coefficients a of the Dirichlet distribution arc allowed to depend on L n , so 
that a k = <Xk,L„- 

Theorem 3. Let /o belong to C a [0, 1], where 1/2 < a < 1. Let H be the prior onW 1 c T defined 
by (15). Then, for e* a defined by (2) and any M n — > oo it holds, as n — > oo, 

^n[/: ||/-/ o |U>M n < i JXW]^0. 

According to Theorem 3, random dyadic histograms achieve the precise minimax rate in sup- 
norm over Holder balls. Condition (15) is quite mild. For instance, the uniform choice «o = 
• • • = ci 2 i-i = 1 is allowed, as well as a variety of others, for instance one can take a k = otk.L n 
to originate from a measure A = Aj, n on the interval [0, 1], of finite total mass Aj, n := A([0, 1]). 
By this we mean a k = A(I, "). If A/ ' A~L, n has say a fixed continuous and positive density a with 
respect to Lebesgue measure on [0, 1], then (15) is satisfied as soon as there exists a 8 > with 
2- SL " <A Ln <2 L ". 

3.3 Discussion 

We have introduced a methodology which allows to obtain optimal minimax rates of contraction 
in strong distances in density estimation. The essence of the technique is to view the problem 
semiparametrically as the uniform study of a collection of scmiparamctric Bayes concentration 
results, very much in the spirit of nonparametric Bernstein- von Mises results as studied in [4]. 
For the sake of clarity we refrain of carrying out further extensions in the present paper but 
one can note that the impact of the techniques goes fairly much beyond this. Let us mention a 
few examples. From the sup-norm rates, optimal results -up to logarithmic terms- in L 9 -metrics, 
q > 2, can be immediately obtained by interpolation. Adaptation to the unknown a could also 
be considered. However, note that 'fixed a'- nonparametric results as such are already very 
desirable in strong norms. They can for instance be used in the study of remainder terms of 
semiparametric functional expansions or of LAN-expansions, as e.g. to check the conditions of 
application of semiparametric Bernstein- von Mises theorems as in [2]. In this semiparametric 
perspective, adaptation to / is in fact not always desirable, since posteriors for functionals may 
behave pathologically when an adaptive prior on the nuisance is chosen, see [14] and [5]. Also, we 
expect the present methodology to give results in a broad variety of statistical models and/or for 
different classes of priors. Indeed, it reduces the problem of the strong-distance rate to two parts : 
1) uniform semiparametric study of functionals and 2) high-frequency bias. The first part is very 
much related to obtaining (uniform) semiparametric Bcrnstcin-von Mises (BvM) results. So, any 
advance in BvM theory for classes of priors will automatically lead to advances in 1). As for 2), 
although ad-hoc methods can be used depending on the model and prior at hand (e.g. choosing 
priors cutting high-frequencies or applying a concentration of measure inequality), the example 
of log-density priors indicates that a pre-processing step may be useful in general to 'adapt' the 
problem first to the 'geometry' of the prior. 



4 Proofs 



4.1 Gaussian white noise 



Lemma 1. Let X^ n ' follow model (4). Let /o satisfy (5) and let the prior U be chosen according 
to (6). There exists C > such that for any real t and I < L n , with L n defined in (3), 

E^ o E u [e tV " (flk ~ x,k) |X (n) ] < C*e' 2/2 . 

Proof of Lemma 1. The proof is similar to the first lines of the proof of Theorem 5 in [4]: one uses 
Bayes' formula to express the posterior expectation in the Lemma. Next, using (5), one finds a 
universal L$ > such that for any l,k and v € (— Lq,Lq), the expression involving the density if 
in the next line is bounded below (and also above) by a fixed constant, and thus can be removed 
from the expression, leading to 



£llt e t^n(fi k ~xi k ) I J£-(«)l _ g-tel 



fe 



-v 2 /2+{t+e lk )v ( fojlk+v/Vn 



v( s 



f/r 



J ( r-v>/2+e ll .v (p f fo,lk+v/V^ \ dv 

J e tv-(v- Slk f/2 d r tu-u*/2 dv 
< p — tei k _J < J 



< ,'V2 



r° e-^-^YI^dv r° e-^-ei k Y/2dv 

•1 — Lin J — -Lo 

-i -1 



Ln 
L 



-(„- 



' )2/2 rfw 



-io 



Since £;& are standard normal, simple calculations show that the expectation of the inverse of the 
quantity under brackets is bounded by a universal constant, as in [4] page 19. □ 

Proof of Theorem 1. Small I. Let us first consider indexes I with I < L n . For any real t, set 

Qi k {t) := E^ Q E n [e t ^^'"" x "''> | XW], Using the fact that if is bounded, 



Qik{t)<E% 



Je 



t(v-e lk )- (v y^ dv 



J e 2 vx — tr- ' du 



Introduce the set, for any possibly /-dependent sequence Mi, 



A{Mi 



fo,lk 



•v^ 



cri 



<Mi 



(16) 



Choose Mi = C(l + 1) M with fi as in (8). In particular, along with (8) this implies that for this 
choice of M; and C large enough, A(Mi) contains the interval (—1,1). First restricting the integral 
on the denominator to (—1, 1) and next using the tail condition on ip (and the fact that tp > c v > 
on (—1, 1)), one gets 



U{t)<El- 



c a 



< «V+i. 



- l I\ 



e 2 dv 



The maximal inequality argument from Section 2.1 immediately yields (i) < e* 
Large I. Let us now consider the case I > L n . For any real t set 



£; o £ n [e t/,fc |X (n) ] =£}>, 



/' 



*(/o,ifc + -7=) -%-+e !fc i> 1 



x/nc; 



P 



*>.'*+& l du 






=:E 



n N lk (t) 



h D 



Ik 



To bound the denominator, first restrict the integral to the set A := -4(1) (see (16)). Set 

l ( hik + -^ x 



Q = I v—^ip 

I A V n(7 l 



dv, 



<Jl 



next apply Jensen's inequality with the logarithm function to get, with \A\ the diameter of A, 

\A\y\ 



logD lk > 



y/TMJi ve A 
-^ / 2 i f2 



SUp V + SikQ 



M°f + fa,ik) + ejfcCi. 
where we have used that Mi = 1 in (16). The following bound will be useful in a moment 

l-AIIM 
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< 



moi vG a 



sup \v\ < Vn(l/o,ife| + t{). 



To bound the numerator from above, split the integrating set into A := A(l) and A c and write 
Nik(t) =: Ni k '(t) + Ni k '(t) for the integrals over each respective set. It holds 



E 



n <\t) 



h D 



< e \t\<7,+Cn(,af+fl lk ) 



Ik 



?{v-ii)si k 



h-v ( fo ' lk+ ^) dv 



inoi 



cri 



< e^+^^+^H^ ll^l^ sup e^'l 

veA 

< p \t\<7i+C"n(*?+fl lk ) 



On the other hand, the term over A c can be bounded as follows 



N%\t) 



E n ly lk K b J < e Cn(af+fl lk ) 



Jf °~ D, 



tuiw pn 



3 --|(tO(T i -/o,u ! ) 2 +£j fc ( v /n(«>cr i -/o i ( fc )-C!) 



(p(w)di 



< e C-n{°i+fiik.)+-ir 



J.<Jiw-^/nC,i{<Jiw-f a i k ) 



(-1,1)° 



tp(w)dw 



< e C"n(o-, 2 +/ 2 !fc ) / e (t<Ti+VncriCl)« ' l n(w)d 



W. 



Using the tail behaviour of <p leads to 



£' 



(21 



/O £) 



/A- 



One deduces, using that for / > L n , one has n(af + /p lk ) < n2 l ( 1 + 2a ) < logn < I, that 
R lk (t) :=E%E n [m a x\f lk \\X^} < \{l + log(e to < + e C ^ t+ ^»^) 



t 

< - + °i + jWi(t + V^O)}— . 



— 1 . 

Set t = a, I 5 ^ to deduce 



Rik(t) < l^ai < 2-'^ +Q ) 



and further obtain (ii) < X)z>l 2a 2 '(2+") < h" = e* a . Therefore, for any <5 > 0, the rate is 
precisely e* n a . □ 
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4.2 Density estimation, notation 

Given observations I" from (9), denote by £ n (f) the log-likelihood £ n (f) = J27=i l°g/(^)> well- 
defined for any density / in the class J- '. For any u, v in L 2 (Pf ) =: L 2 (/o), define the inner-product 
(• , ■) L with associated norm || • ||j,, together with a stochastic term W„(u), as follows 

(«: u )l = / ( u - p h u )( v - p f v)fo 



1 - 

(u) = —Y,Mx>) ~ Pf u}. 



In particular, in empirical process notation W n (u) = G n («). For any / in J 7 , set R n (f,fo) = 
VTiP fo log(/// ) + n|| log(/// )|||/2. For any density / e J" it holds 

£n(f) - ln(fo) = "||| Iog(/// )||l + VnWnOog(///o)) + ^(/, /o)- 

Denote, for any density / in J 7 and any given u in L 2 (/o), 

B{u,f,f ) = (t^,u) L - (log(f/f ),u) L . 

Jo 

Let D ra be a measurable set. Denote by n D ™ the restriction of II to D n . Suppose, as n — > oo, 

sp o n(D„|x(")) = l + o(l). (17) 

Combining (17) and Markov's inequality leads to, for any M n — > oo, 

E] U[f : H/-/0H00 > M n e* nia | X«] < (Af^J" 1 ^ [£ n ° n [||/ - JolU | X^]U(D n \ *<»>)] +o(l). 

In the sequel we focus on bounding i? n " [||/ — /o||oo I X^] from above. 

4.3 Density estimation, log-density priors 

Let us define the set D n by, for e n = (logn) u e nta the rate in (14), L n as in (3) and ( n = e„2 L "/ 2 , 

Dn = {/, ||/ - /o||2 < e„, ||/ - /olloc < &}• (18) 

Under (14), Lemma 4 implies (17), up to multiplication of e n , f„ by large enough constants. 

4.3.1 First step, reduction to the logarithmic scale 

Let us set g = log/ and go = log/o- In particular, g = T — c(T). First one notes that obtaining 
a rate going to for \\g — go\\oa implies the same rate up to constants for ||/ — /o||oo- Indeed, 
11/ - /olloc = ||e9°(e 9 " 90 - l)||oo < 11.9 - 5o||oo using the bound \e x - 1| < |x| for small x and that 
||/o||oo is bounded. So, instead of writing Markov's inequality as above with /, we write it with g, 
the set D n still being the one defined in (18) with the dependence on / — /o- 

That is, we focus on bounding E n " [\\g — go\\oo \ X^} from above. Now write, with the 
notation g Ln denoting the L 2 -projection up to level L n as in Section 2.1, and L n as in (3), 

E n ° n [||.g - .9o||oo | X^} < J ||.9 L " - flMloodn^Cf I X^) + I \\g L « |U(flI D -(/ | IW) + ||.9o "||oo 

= : (i) + (ii) + (Hi). 

The term (ii) is because the sum defining T goes up to level I < L n under the prior distribution, 
and the constant function 1 is orthogonal to higher levels. Since go = log /o belongs to B^ ^ by 
assumption, the term (iii) is bounded by a constant times e* a . 

11 



We now start analysing the term (i). First, let us introduce, for {Ai,k}i,k a collection of 
elements of £ 2 (/o) to be chosen later, and L n as in (3), 



£„ 2 ! -l 



r L "(-) := 5oH) + \ £ E W n (A hk )i> lk (-) 



(19) 



1=0 k=0 



Next let us write 



0= [\\9 Ln -r Ln \\c an. Dn {f\xW) + \\r L » 



9o II oo- 



The second term is bounded with the help of Lemma 7. For the first term, following the scheme 
of proof of the maximal inequality in Section 2.1 via the moment generating function, one sees 
that it is enough to bound for t > the following quantity, uniformly in I, k and I < L n 



Mikit) := e _ * w " ( - Ai ' fc) _E nD " e t V™(g-go4nk) 2 l jf(«) 



(20) 



Denote p(x) := log(l 



x. It holds 



(g - ffo)TO 



log 



f-fo 
h 



1 



to 

i 



1 f-fo to , r / /-/n , 

— 7 — t-jo + p[ — t — I to 

Jo JO Jo \ JO 



On D n we have an intermediate sup-norm rate £„ = o(l) when a > 1/2. In this case the argument 
of p in the previous display tends to 0. Using the bound \p(u)\ < u 2 for small u, one gets 

I [ p (^) toI < IItoIU f Q (^) 2 ^ 2^11/ - Ml (2i) 

This bound is a 0(l/y/n) on £) n as soon as 2 L "/ 2 e 2 = 0(l/\/n), which is satisfied if a > 1 (quite 
interestingly, one can show that the present reduction at the logarithmic level is not what limits 
the rate in the case 1/2 < a < 1, see the end of the proof). What precedes suggests to write 

Vn(g - 9o,i>ik) 2 = Vn(f - jo, 0,fc) 2 > 
where = means up to a remainder term -here it has just been controlled above- and where 



, TO 

U,k — -£-• 

JO 



(22) 



This means that we can reason as if one would be considering the scmiparametric problem of 
estimating the linear functional of the density / — > (O.fc,/^- The corresponding efficient in- 
fluence function is C,ik = 0,fe — -P/oO.fc, with respect to the tangent set Hf a := {h : [0,1] — ¥ 
K, h bounded, J Q hf = 0}, see [22], Chap. 25 for definitions. 

There is one difficulty with Q_k- Since it is not an element of the basis of expansion of the 
prior II, it needs to be properly approximated by the prior in some sense. In fact, there is a 
fundamental difference with what has been done so far in proving BvM-type results, see e.g. [2], 
[14], [ r >]. Here we need to study approximating sequences uniformly in the indexes I and k, and a 
sharp control on this dependence is essential, see the key Lemma 2, where two regimes of indexes 
T arise, depending on whether I is small or close to L n . 

So, instead of working with Q^ directly, one replaces it by an approximation Ai^k, leading to 
the scheme 

Vn(g- go,ipik) 2 = Vn(f - fo,(i,k} 2 = Vn(f - fo,Ai,k} 2 
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This of course induces a bias term for any I, k, familiar in the context of semiparametric BvM 
results, see e.g. [2], [3], [5], equal to 

y/n(f - /o, Ci,k - A, k ) 2 = y/z[ (f- f )(Ci.k - A Lk ). (23) 

Jo 

This term is controlled using Lemma 2 below. Indeed, on D n the bounds of (23) of Lemma 2 are 

at most y/nh n e n = o(l) if a > 1. Next apply Lemma 3 with j n = Ai tk - The estimates of L 2 and 

sup-norm of Ai tk imply that the conditions of application of Lemma 3 are satisfied. Thus 

where we have set ft = e 9t with g t = gt./^ defined as in Lemma 3 by (the expression is invariant 
under adding a constant to g, so one can write it either with Ai. k or Ai tk = Ai. k — Pf Ai jk ) 

9t= 9- 4j4,fc- log J e 9 -^ Al ' k . (25) 

In the case 1/2 < a < 1, the cost of replacing / by log/ is controlled by \\g — go — (f — fo)\\oo = 
\\fop(f/fo — l)||oo ^ l|///o — l||So) which is bounded on D n by a constant times £ 2 via Lemma 4. 
Using Remark 1 below, the bias (23) leads to an extra term exp{ty/nn K ~' ia / 2 ^ 1 / ( 1 + 2q )} i n (24). 

4.3.2 'Uniform' approximations of efficient influence functions Q k 

For any / < L n and k between and 2 l — 1, define Ai, k to be the L 2 -projection of Q >k on the space 
spanned by the first L n levels of wavelet coefficients, 

Ai tk = ^2 5Z (0,fc,^A M ) 2 ^A M - (26) 

1<A<L„0<a*<2 a -1 

For any I, k in the previous ranges, we also set 

Ai, k = Ai. k - P fo Ai, k . 

Lemma 2. Let /o belong to C a [0,l], with a > 1. For any I such that 1 < 2 l < 2 Ln and 
< k < 2 l — 1, and any density f in T , 

\\A Uk -CiA\oo<2 l ^ +a h- aL -, 

{A,, - Ci, k )(f - /o) < (2('- L ")« A 2-')\\f - / || 2 . 


Proof. For any admissible indexes A, /i, let Sx^ denote the support of the wavelet ip\fi m [0, 1] 
and |<Sa^| its Lebesgue measure. The following identity holds both in i 2 [0,l] (definition of the 
i 2 -projection) and in L°°[0, 1] (because Cz,fc € -8^,00 with S > 0) 

2 A -1 

0,k-Ai, k = Yl Z)(0,fc>^>2^- (27) 

A>L„ m=0 

Since ^fe belongs to B^ ^ and /o to C a , Lemma 5 implies that Q t j. = ipi k ■ f^ 1 belongs to B^ x , 

with || -1100,00^-norm bounded above by a constant times ||V'/fe||oo,oo,Q||/o~ 1 ||oo,oo,Q < 2 l( -i +a \ again 
by Lemma 5, using f^ 1 e C a C 5^ „<,. Now using the localisation property of the wavelet basis, 

2 A -1 

\\A Lk -G.fclloo < II 5Z 5Z (O.fcjVV^VvlU 
A>L„ A'=0 



< V 2 V2 2 -A( Q+ l/2) max r 2 A(oH-l/2)| (C ^ ) I 
*—? 0<u<2 A -l L 



3</i<2 A -i 
A>L„ 



-at„ 



< llfi i II V^ 2~ Aa < o l (h+ a )2~ a 

— || V,fc ||00,00,a / J & rO ^ ^ 

A>L„ 
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Now let us prove that the support of £;,fc — Ai t k has diameter at most a constant times \Sik\- 
Indeed, £j j. — Ai,k written above is a linear combination of 'high'-frequency wavelets (A > L n ), 
with support diameter thus at most of the order of \S\n\ < i2|iSjfc|, for a fixed constant R, for 
any A > L n , any admissible /z, since A > I. But in the sum (27), one may keep only those V'a^ 
whose support intersects the one of Q^, otherwise the coefficient (Ci,k 1^x^)2 ' 1S 0- I n particular, 
all supports of the ipx^'s which have a nonzero contribution to (27) are contained in an interval of 
[0, 1] of diameter at most (2R+ l)|Sjfc|. Thus the support of Q k — Ai k itself has diameter bounded 
hy(2R+l)\S lk \. 

Now we turn to the bound on J (Ai.k ~ (i,k){f — fo)- Let A/^ denote the support of Ai t k — (i,k- 

The last integral can also be written / (Ai : k — Cl,k)(f — fo)^Ai tk - Conclude, bounding A^k — Ci.k 
by its supremum and next applying Cauchy-Schwarz inequality, 



I / (-V- - Ci, k )(f - /o)| < \\A, k - Ci,k\\ooV\^\\\f - fob < 2<'- L »> a ||/ - /o|| 



To obtain the other part of the bound, the idea is to use a different approximating sequence T>i^k 
for which the comparison to 0,fc is easier for large Vs. Define 2?/^ to be the function obtained by 
replacing f in (22) by its average on the support of ipik, 



V 



Lk 



[fo] 



Ik 



(28) 



where we have set 



[fo. 



Ik 



\s, 



lk\ JSi k 



fo- 



Note that since / < L n , by definition T>i t k belongs to the vector space generated by the first L n 
levels of wavelet coefficients. In particular it holds ||-Af,fc — 0,fc H2 < ||A.fe — O.fclb by definition of 
the I/ 2 -projcction. Since by definition again T>i.k — 0,fc nas support included in Sik, one gets 



WAk - O.kWl < I l* th (pi,k - 0,fe) 2 < WDi,k - 0,fe|lLl^fel- 

Next one bounds the last sup-norm. Denoting po : — hrfxefo,!] fo( x )-> 



\\Dl,k - 0,fc||oo < Pq 1 



< 



< 



< 



< 



Po 



Po 



Po 
Po 



\lplk\\oo SUp |/ (x) - [/ ]zfc| 

xeSi k 

\ipik\\oo sup IS'ifcp 1 ! / (f {x) - fo(u))du\ 

xeSi k J s tk 

\ipik\\oo sup IS'ifel" 1 / \x-u\du 
ooeSi k Jsik 

r\Sik\ 

\tpik\loo\Sik\~ 1 / udu 

Jo 

\ink\\oo\Sik\- 1 \S lk \ 2 /2<2- 1 / 2 , 



where for the third inequality we have used the fact that fo is at least Holder 1 and for the last 
inequality that \Sik\ is of the order 2~ l and ||^ife||oo of the order 2 l l 2 up to constants. Thus 

I I (A Lk - Ci, k )(f - /o)| < \\A l>k - Ci,kh\\f - /0II2 

Jo 



< V\S^\\\Vi,k - Ci,fc||oo||/ - /0II2 < 2-'||/ - /, 



2- 



□ 



Remark 1. In the case 1/2 < a < 1, similarly one gets the bound (2~ la A (2'~ L ") Q )||/ - / ||2- 
The minimum of the bounds is attained for 2 l = 2 L ™/ 2 . This leads to a bound for the integral 
Jo (Ai.k — Q,k)(f — fo) equal to n(- 3Q / 2 )/( 1+2a ) for all considered indexes I, k. 
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4.3.3 Change of variables 

Now everything is in place to start exploiting (24)-(25). First, rewrite (24) as 

JV-(*>-M/o>dn -(/) _ rfnV1 Ji PB (/)eM/o-MA>>dn(/) 

The expression log/t = (25), due to its invariance by adding a constant and recalling that g = 
T — c(T) from (11), can be seen as a function of T — tAi,k/\/n (the constant c(T) vanishes). More 
precisely, we are now ready to change variables in the prior by setting 

f = T- -^A,k (30) 

v n 

Essentially, if the 'complexity' (in a sense to be specified, this can be read as 'size') of Ai t k is not 
too large in view of the chosen prior II, the fact of having f t instead of / in (24) will not matter 
much and the corresponding ratio of integrals will be close to 1. We treat the case of heavy tailed 
priors on coefficients first. In fact, as can intuitively be guessed, a prior with heavy tails is less 
influenced under shift transformations than a more concentrated prior. 

Denote by C n = {(A, /i)eN 2 , < /x < 2 A - 1, 1 < 2 A < 2 L "}. By the definition (26) of Ai, k , 

(Ai ! k,ip\^) 2 = (0,fc)Vv)2' (A,/x)eC„, (l,k)eC n . 

Heavy-tailed prior. With the chosen prior on /, the numerator in (29) is in fact an integral over 
the law of the coefficients of T in (10), that is, over (a subset of) R 2 ". The change of variables 
(30) is thus a shift in M 2 " , and its Jacobian is 1. The coordinates of T in the wavelet basis {i/jx/j.} 
have densities a^ <p(0x.fj./&x) with respect to dOx.^ (we denote by 6\^ the integrating variable). 
The transformation in density can be controlled by, since ip = (pn has a Lipschitz logarithm, 

log TT J^) 

(x,t)ee n VWM> - 7h<A^Vv) 2 }M) 

= E ( lo S l P)(^/^x)-(}ogip){{ex, l ,--^=(Ai,k,ikx^) 2 }/crx) 

\/lt 

(A,ft)ec„ v 

1*1 



< E -7=H<Ci,k,^> 2 |. V(Z,fc)eC, ; 



Wc now study conditions on the ax's under which the last display is bounded above by C\t\. Let 
us split the sum over C n in the two cases A < I and X > I. When A < I, for any fixed level 
A there is a constant number of wavelets ipXfi intersecting the support of £;.&. Combined with 
\{Ai t k,i>XiJ.)2\ — Pa ) ^is leads to the condition ^2 x <i a x ~ V™- When A > I, for any fixed level 
A there is a constant times 2 x ~ l wavelets ipx^t intersecting the support of £j fc, leading to 

E *X l \(Cl,k,lpX») 2 \< E ^ 1 2 A -'2- A( * +a) ||Cl,fc||oo,=o,a 
(A,A»)eC„ KX<L n 

< E ax^O-W"-®, 

KX<L„ 

where we have used that Q_ k = ipik/ fo is a-smooth in a similar way as for Lemma 2. These 
conditions are quite mild. In particular for a > 1/2 they are implied by X)a<l a x ~ V™- 
Gaussian prior. Let us write explicitly the log-ratio of densities 

E ( 1o S¥>)(0a,mM) - pog¥>)({0A,„ - -7={Ai,k,i>Xn) 2 }/o-x) 

(A,p)ec„ v n 

t 2 t 

= E 2(A > k,i>\vd2 l=^20x,n(Al.k,1pX f j,) 2 - (31) 

(A, M )GC„ A V A 
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The obtained quantity still depends on the integrating variables 9\^. The idea is to exploit the 
fact that on D ni it holds \\g — 50 1! 2 ^ £>i, which is obtained along the way in the proof of Lemma 
4. But g = T — c(T) = T — c(T)l, where 1 denotes the constant function equal to 1. Since 1 is 
orthogonal to high levels, it means that for any large enough A, say X > K . and any ^t, it holds 
|#A, M - 9o,x,n\ < 11.9 - ffolb < e n . So, for such A, n, we decompose 6» A , M = 9\,n - .9o,a, m + 9o,x,^- For 
coefficients 9\ tll such that A < K , we use a different argument. 

Let us first deal with the term containing 9\^ in (31) when A < K, From the beginning of the 
proof, using Lemma 9, one can restrict slightly the set D n by intersecting it with {T : ||T||co < 
Cy/n£n\. This implies that max\<K,ti \0\,k\ ^ \fne n . Hence, using that \{Ai } k, l 4 , x^) 2 \ ^ 1 an d the 
assumed specific form of o\ , one gets that the term at stake is at most a fixed constant times e n . 

Now we bound (31) and only have to deal with A > K for the part depending on 9\^. 

For any (I, k) <G C n , the first term in (31) is 



E 

(A,/i)€C„ 



Ik 






V'Am) 



< 



f 2 

-E^a : 

n ^ A 



E 

KX<L„ 



-2cy(l-X)2a 



where the bound is obtained in a similar way as in the heavy-tailed case by distinguishing the 
cases A < I and I > A. For the second term we decompose 9 = 9 — go + go and use Cauchy-Schwarz 
inequality on the 9 — go part to make appear 



J*Li 



g-9oh{ E ^a^-t^Va^} 

(A,(i)6C 






2,1/2 



< 



v A<i 



E 

!<A<L„ 



cr -4 2 (/-A)2Q|l/2^ 



Finally the remaining term with 50 is bounded by 



7m E G 



-2, ,i,^ii 

A IS0,A/d 



M 



(A, M )ec„ 



^, w> 2 i < ^=(y<j^2-^+^ + y ff -2 2 (^A)2 tt2 -A(i +Q)) 

to vTl — ' — ^ 

JU v A<2 KX<L n 



If one asks for the mild condition o\ > 2 A (2+ Q ) for any < A < L n , all bounds obtained 
above in the Gaussian case are at most a constant times (\t\ + 1 2 ). 



4.3.4 End of proof 

To conclude for both considered classes of priors, note that the indicator ll£> n (/) in (29) becomes 
H-D' (ft) under the change of variables - for a set D' n that one can write explicitly, although this 
will not be needed here- and one simply further bounds the indicator lljji by 1 on the numerator. 
Once the change of variable is done, the assumed conditions on {a{\ ensure that the ratio of 
densities is bounded by e c "*' +t - 1 for some constant C and one gets 

I e^M-^H Dn (f)dU(f) cm+t2) Je^n-iMo) d u(f) c(|t|+t2) 
/ e M/)-M/o)dn(/) - / e M/)-M/o)dn(/) 

In particular, inequality (24) can be further written, again for a fixed constant C, 



e -tW n (A,, k ) E n D " 



jVn{g-g ,ipik) 



* \X 



(n) 



= E l 



e tV^(g-r L '\Tpi k ) 2 |x(")] < e c ' (l ' l+ ' 2) II(L'„ IX^) -1 . 



Write, for any s > 0, similar to the maximal inequality used in the Gaussian white noise case, 



[\\9 Ln -T Ln \\ocdn D "(f\X (n) )< E ^E nDn [ max \^(g-T 
< ^2 ! ^lo g 



l ,ip, 



ik> 2 \ 



jfWi 



1<L„ 



y E n [ e s ^(9- r n ^ik) 2 J re -sV^(g-r Ln 4nk) 2 ij^Wi 



fc=0 



< y 2 '/2±. log[2'e c ( a+a2 )] + V 2'/ 2 ^ log 



KI„ 



Z<L„ 



u(D n \x^y 
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Set s = wl. The first term in the last display is bounded by a constant times -4= X^<l V'2'/ 2 < 
e* a . Now coming back to the application of Markov's inequality one gets 

E%n[f : ||/ - ./oiloo > M„< Q | Z«] < M" 1 + M" 1 !^^ | iW) log —^—^ + (1). 

With M n — > oo, the fact that u —$■ ulogu" 1 is bounded on [0, 1] concludes the proof. □ 

In the case 1/2 < a < 1, the only difference is that one gets two extra terms: one from going at 

the logarithmic level, which eventually leads to a rate <^; another one from the semiparametric bias 

in (23), which leads to a rate p n = ?i(2^-f-)/( 1+2Q! ). This leads to a sup-norm rate of (^ V p n = C 2 - 
Once the rate (,„ has been obtained, one can restart the proof once again, but this time knowing 

that one can use a better intermediate rate of (^ in sup-norm. One can then write 

i [ P( f -=Awi k \ < iip(£^)iuiimii < n^n 2 ^ 2 < (a 2 2-*/ 2 . 

J Jo Jo Jo 

This eventually leads to the accelerated rate (Cn) 2 Vp n . Iterating this procedure leads to (C 2 ) 2 V/9„, 
any p > 1 and, for any given a > 1/2, to p n as final rate, up to logarithmic terms. 

4.4 Density estimation, dyadic histogram priors 

We follow the scheme of proof used for uniform priors in white noise, this time in density estimation, 
with the wavelet basis of expansion being the Haar system. The very specific properties of the 
Haar basis, particularly its close links to approximation by dyadic histograms, enable a simplified 
argument. In particular, as we demonstrate below, the semiparametric bias is always negligible, 
provided the parameters of the prior are reasonably chosen. 
Let us set 

n 

i=l 

Set D n = {/, h(f, /o) < s n }, where here e n — £* a up to multiplication by a large enough constant. 
Lemma 10 implies that El n[D n | X^ n >] — > 1. Let h n , L n be defined as in (3), and denote by f Ln 
the projection of / onto the subspace Vl„ '■= Vectj^^, -0^, I < L n ,0 < k < 2 1 } and f L " n the 
projection of / onto Vect{ipfl, I > L n ,0 < k < 2 1 } (that is, for simplicity we keep the notation 
jL n f rom Section 2.1, although the basis of projection is now the Haar basis and I < L n replaces 
I < L n ). If / ™ denotes the element of Vl„ of coordinates {fik} in the basis {tpfl}, 

E nD " [||/ - /oil oo I x^} < J h/*"* - M\oodn. D "(f | xW) + J n/^iudna | iW) 

+ ||/ L " - /(fllco + ||/o L ioo =: (i) + (*i) + (Hi) + (iv). 
The term (iv) is pure bias while (iii) can be written 

L„-l 2 ! -l 
0=0 fc=0 

This term is bounded in expectation by e* a , exactly as in Lemma 7. 

Next, the high-frequency bias term (ii) is zero. Indeed, for any draw / from the prior, in the 
inner-product (/, ipfl) 2 i t ne first element is a dyadic histogram at resolution level L n , so is constant 
over the support of the Haar basis element ipfl if / > L n . Hence the previous inner-product is zero 
n-almost surely, and thus also II [• | X^] almost surely. 

Second, one studies (/ n — / L ",V'^)2 m ^ ne BvM-regime I < L n . Following the maxi- 
mal inequality approach from Section 2.1, it is enough to bound the posterior expectation of 
exp(i- s /n(/ L " — f Ln jipfDz), for any possible k and I < L n and say \t\ < logn (also, to simplify 
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the notation below we omit mentioning the scaling function ip , but the same Laplace transform 
control is obtained for it in a similar way). To do so, apply Lemma 3 with j n = ipfl, for any given 
l,k with I < L n . The conditions of the Lemma are satisfied with a n — e n , since ipil 1S bounded 
in £ 2 [0, 1] and has a sup-norm bounded by a constant times 2 1 / 2 < 2 L ™/ 2 = o(l/e n ) if a > 1/2. 
Noting that, again for I, k with I < L n , 

{f L n _ jL %i ,H )2 = {fLn _ f Ln^H }2 + {f L n _ fL n ^H )a = {f _ J^H^ _ ^(^ 

an application of Lemma 3 thus leads to, with IT " the restriction of II to D n , 
£n- [e VH(/--/-^> 2 ]x(n)] < ^ /eM/O-M/o^a) 

where ft is defined by log/4 = log/ — txjjfl/y/n— c(fe~ t '^ lk 'v^) (again, having ipfl or ipfl at both 
places in the last equality does not matter since the constant simplifies). Below to simplify the 
notation we denote 7„ = i/j^. 

In the last display, once coming back to II via dH D ™ (/) = ll£> n dH(f)/H(D n ), the variable / is 
a random dyadic histogram over the subdivision with intervals Ijf n = (/i2 _L ™, (/i + 1)2 _L ") and 
< /i < 2 Ln — 1. Denote by 7„ jM the value of j n over the interval ljj n . Next observe that both /, 



and / are, writing the histogram prior in terms of its coefficients over the subdivision, functions 
of w = (woj • • ■ > 0J2 L n-i) and the integral over / is nothing but an integral over w £ Sl„- On 
the other hand, from the expression of f t , using the fact that ipil i s constant over each individual 
interval 7 M (since I < L n ), one sees that f t is a dyadic histogram over 1^ with weights given by 
the vector £ 

C := (C/0o<„<a*-i = l^— ^W^Jo<,< 2 ^-r (32) 

Now we are in position to change variables in the above expression by taking £ as the new variable 
(this technique is developed in [5] for general, fixed influence functions) . The change of variables 
introduces a multiplicative factor M(£) in front of dH((), factor which is the product of the 
variation in density of the Dirichlet law under the change of variable and the Jacobian of the 
change of variable, 

/e M/,)-M/o) dn ^(/) f Dn e<~V«»-<"lMM(QdD a (0 

J e M/)-M/o)dlTM/) [ n| l JV»C/("»-Wo)dZ? a ( w ) ' 

where D n is the new integrating set after change of variables and the notation /(£) is used for 
/(O(') = 2 L J2"n=o C^i L (') an d similarly for f(uj). Computation of the Jacobian say A(£) is 
done in Lemma 11. Calculating M(C) = dT> a {oj) j dT) a (()A(Q gives that the multiplicative term 
M(C) satisfies 



M(C)e~^ 



Eu«MTn 



e *l~i.«){^f(£)(x)dx 



II 



e 



-tln{x 



i)(x)da 



For the term on the right hand side, since a > 1/2 it holds ^H^nDoo/v^ = °(1) so one can expand 
the exponential function by writing e" = 1 + 0(u) as u — o(l). Next, write f = fo + (f — fo), 
so that the expression under brackets writes 1 + O((t/y / n)[(/o,7 n ) 2 + (/ — / ,7„) 2 ]). The last 
term is a 0(\t\/^/n), since the Haar-coefficients of /o are certainly bounded and those of / — /o 
are bounded above by a constant times ||7„||oo^(/, fo) which is bounded because 2 Ln ^ 2 e ri = 0(1). 
So if J2u a ^/ \fn = 0(1), the term at stake is 0{1). But this condition follows from (15), because 
the number of terms in the previous sum is 2 L " = 0{^fn) when a > 1/2. 

Now we deal with the exponential term on the left hand-side of the last display. A term in 
the sum ^ ct^-f n ^ is nonzero only if the support of the Haar basis element j n = xp^. (or (p H ) 
intersects l£ n . This is the case for 2 Ln ~ L terms for ipfl and 2 L ™ terms for tp H . Using H^^Hoo ^ 2'/ 2 , 

IS 



this shows that the considered sum is at most a constant times 2 Ln ~ l l 2 < 2 Ln . In particular, fol- 
ia > 1/2 the considered term is a 0(1). 

The previous reasoning shows that the change of variable part generates a multiplicative factor 
0(1), times the term e ct coming from the application of Lemma 3. To conclude it now suffices 
to apply the maximal inequality technique in the same way as for log-densities prior. □ 

4.5 Tools for density estimation 

The notation here follows the one introduced in Section 4.2, in particular || • \\l, W n , R n and B. 

Lemma 3. Let /o belong to T . Let {a n } be a sequence of reals such that na„ > 1 for any n > 1. 
Let {II„} be a collection of priors on densities restricted to the set {f,h(f,fo) < a n }. Let {jn} be 
an arbitrary sequence in L°°[0, 1]. Set 7„ := 7„ — Pf j n . Suppose, for some m > and all n > 1, 

\\ln\\L < m, llTnlloo < (4a n log(n+l)) _1 . 
Then there exist C > depending on m, ||/o||oo only such that for any n > 1 and any \t\ < logn, 

Je^(/*)-^(/o)dn n (/) 



E L 



ot\/n(f-fonn) 2 I jfW 



< Ct"+tW n (<y„). 



Je^(/)-^(/o)dn n (/)' 

w/iere /* is defined by log/* = log/ - t%j \ph - c^fe"*' 1 ™!^). 

Proof of Lemma 3. Denote g = log/, go = log/o- From elementary algebra it follows that 

VRf - /O, 7n) 2 + M/) - M/o) 



1.9 ~.9o 



-i=ln\\ 2 L + VnW n (g - .go l=ln) 

Jn Wn 



tV^B(%,fJ ) + R n (f,fo) + -B 



n\\L 



tW n (j n ) 



= t n (ft) ~ Ufo) + [tVnBfaJ, /o) + Rn(f, /o) - Rn(fu /o)] + y lllnll! + t^nftn). 

Let us show that the term under brackets is small. From the definition of R n , ft, 



Rn(f, /o) - Rn(ft, /o) 



■||7n||i + ty/n(g - go, In) L +nlogF 



e v* 



Next expand the last logarithmic term. Using the assumption on ||7 n ||oo and \t\ < logn, the 
absolute value of the exponent in this term is at most 1/4. This enables us to expand successively 
the logarithm and exponential functions, using the inequalities (the first is valid for \x\ < 1/4), 



<l-x + x 2 , log(l + x) < x, 



leading to 



logF 



e v^ 



<logF 


t V 

1 l=ln H In 

y/n n 




<log 


1 


t t 2 

- -^F ln + -F 7 , 

yjn n 


2 
t 



< —j=F% + -F^l + -(F- F )f n . 
y/n n n 

The last term, using Lemma 6 and h(f, /o)||7n||oo < 1 together with ||7„||| < ||7n|li < m., is a 
0(t 2 /n). On the other hand, 

Fin = {F - F Q )% = ( — ~ f — ,ln) L = (.9 - .90, 7n)i + &(%, f, /o)- 

Jo 

Combine the previous results to obtain the desired bound. □ 
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Lemma 4. Consider the log-density prior (11). Suppose go = log/o belongs to C a , with a > 1/2. 
Suppose (14) holds and let £ n ,Cn be defined as below (14). Then M large enough, 

E%U[f : ||/ - / || 2 < Me„, ||/ - / ||oo < M(„ | I (n) ] -+ 1. 

Proof. Obtaining this result could be done as a consequence of the techniques presented in [10]. 
Alternatively, as a first step we follow an idea presented in [14] -originally the authors use it to 
derive a bound on the second moment of log(///o)-. Next, the L 2 -rate e n and sup-norm rate Cn 
are obtained as a fairly direct consequence (this argument appears to be new). 
The starting point is the following classical inequality, see e.g. Lemma 8 in [8], 

l|log(///o)||^<^ 2 (/,/o)(l+log||///o||oo) 

The last term is itself bounded by a constant times £ 2 (1 + log \\T — c(T) — ,goj|oo) by assumption, 
where go = log/o- Next one bounds this last quantity from its expression 



\\T-c(T) -,go||oo = || ^(T-c(T) - g ,ipik) 2 ^ik\\c a 
l,k 

< Y, 2'/ 2 max|(T- C (T)- ff0 ,^ fc ) 2 l+ £ 2 1 / 2 max \(g , $ 



l<L n l>L n 

Since go is Holder a, the last term is bounded by a constant times e* a = o(l). For the middle 
term, Cauchy-Schwarz inequality yields the bound 2 L ™/ 2 ||T — c(T) — 30II2, using J2kl ^ ~ ^ L " 
and bounding the maximum of squares by its sum. Deduce that 

II log(///o)||| < el + e 2 n 2 L ^\\ log(/// )|| 2 < e\ + e 2 n 2 L ~' 2 \\ log(/// )||i 

Since e 2 2 L - - o(l) (a > 1/2), one obtains || log(/// )|| 2 < e 2 n on the event {/, h(f,f ) < £„}. 
The squared L 2 -norm of / — /o can be expressed as 

\f-h) 2 =ff {e T -< T ^-lf 
Jo 

Along the way, under the event {/, h(f, /o) < £„}, we have obtained the bound 

\\T - c(T) - floHoo < 2 L "/ 2 ||T - c(T) - g \\ 2 < 2 L "/ 2 e„ = £„. 

The inequality \e x — 1| < C\x\, valid for C large enough as soon as x takes values in a boudcd set 
of R, implies, again in case / is in an e„-Hellinger neighborhood of /o, 

\f /o) 2 < C 2 f f 2 (T - c(T) - go) 2 <\\T- c(T) g \\ 2 < e\. 
Jo 

Similarly, since f is bounded, we get ||/ - / ||oo < C« on {/, h(f, /o) < e n }. U 

4.6 Other lemmas 

Given R > 0, let B^ ^(R) denote the centered ball of .B^^fO, 1] of radius R for the norm 

II • j|oo,oo,a given in Section 2.2. 

Lemma 5. There exists a constant C > such that for any f,g in B^ ^(Rf) and B^ <yo [R g ') 
respectively, the product fg belongs to B^ ^(CRfRg). If f belongs to C a [0, 1] and is bounded away 
from and infinity by universal constants then f^ 1 belongs to C"[0, 1]. Moreover, the || • |j Qj0 o,oo- 
norm of ipik is 

\\M\oo,oo, a = 2 l ^ +a \ 
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Proof. The first claim follows from the main result of Section 2.8.3 in [21] (strictly speaking the 
last result is for functions on the whole K, whereas we consider functions on [0,1], but the latter 
functions can be shown to be restrictions to [0, 1] of elements of B 1 ^ x whose norm is equivalent 
to the one of the restriction, see [13] Proposition 2 for a similar argument in the case of Sobolcv 
spaces). The second claim is a simple computation using the definition of Holder spaces. For the 
last claim one uses the caracterisation of B^ x in terms of wavelet coefficients from Section 2.2, 

which yields HV'ifcHoo.oo.a = maxj/,*/ 2 , '^+ a )(ifc > # fc »> 2 = 2 l ^+ a \ □ 

Lemma 6. Let f, /o be two densities such that /q is bounded away from infinity. Let g be an 
element of L°° such that /i(/,/o)||<7||oo < C\ and \\g\\2 < C-2, for some constants C\,C<i > 0. Then 



\(F - F )g 2 \ < C\ + d ^2 1 1 /o|| 00 + C\. 
Proof. Denote E := L \f — /o|<? 2 - Then by Cauchy-Schwarz inequality, 

(f 1 \Vf-VTo\(Vf + VTo)g 2 ) 2 



e 2 

< 2h(f, / ) 2 J (/ + f )g 4 

<2h(fJ ) 2 [\\g\\lX + 2||/o||oo||5||Lllfflli] 
<2C 2 [S + 2C 2 ||/o||oo] 

This implies that E is less than the largest root of the polynomial X 2 — 2CiX + 4C 2 ||/o||oo, which 
can be expressed in terms of Ci, ||/o||oo- d 

Lemma 7. Let /o € T and go = log/o, and let T L ™ be defined by (19), with Ai,k omy elements of 
L°°[0, 1] such that there exists constants C\,c<i with, for any l,k with k < 2 l < 2 Ln , any n>2, 



HAfclloo < ciy/n/logn, \\Ai,kh < c 2 - 
Then for any n > 2 and L n defined by (3), it holds 

El\\ rLn -9o n \\oo<e* n , a . 
Proof. We proceed exactly as for the proof of the maximal inequality in Section 2.1. For any t > 0, 



In ^ t 



We have W n (Ai,k) = &n{Ai,k) and bounds on exponential moments of the last empirical quantity 
are well-known. From Laplace transform controls one gets, for any real s, 

E n [ e «W„(A.*)] < e 4[/AV°] eMM! ' felloo/V5 \ 

Let us choose t = s/l < \/L n . Under the conditions of the lemma the last display with s — t or 
s = -t is bounded above by e ct . This leads to the bound E? \\T Ln - .9o"||oo ~ £ *n, a - ^ 

Lemma 8. Let tp = ipo an d o~l = 2^ l ^ +1 ' for all I < L n , and any given < 7 < a — 1/4. Then 
(13)-(14) hold. The same applies for tp — (fiH.r o,nd o\ = 2~ la for all I < L n , for any given value 
of the parameter < r < 1 . 
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Proof. The first result is a minor adaptation of Theorem 4.5 in [23], where the authors consider a 
cut-off at n 1 '' 2a+1 ' instead of 2 Ln = h~ x = (n/\ogn) 1 ^ 2a+1 ^ (equality up to fixed multiplicative 
constants). Taking the cut-off at 2 L ™ only changes logarithmic factors in their argument. In the 
heavy-tail case, one adapts Theorem 2.1 in [14]. There are two points to note. First, taking 2 L " 
instead of n 1 ' ( 2a+1 ) induces only, again, an extra logarithmic power in the rate. Second, strictly 
speaking the authors in [15] consider wavelets on the interval via periodisation, which imposes 
conditions at the boundary (periodic Bcsov spaces) , conditions which can be dropped when using 
the CDV wavelet basis. Explicit (re)derivation of the previous two results is omitted. □ 

Lemma 9. Let ip = ipo an d °~i satisfy (13). Then the prior 14 defined by (11), with L n as in (3) 
and e n as below (14), satisfies, for C > large enough, 

^ n[||T|| oo <C v ^e„|X(")]^l. 

Proof. For any given c > 0, Borcll's inequality implies that n[||T|| 00 > Cy/ne n ] < e~ cne ™ for 
large enough C: apply Corollary 5.1 in [24], where o~{T) is bounded using oi < 2~^ 1 / 2+r '. Next, 
one applies Lemma 1 in [!)]. To do so, one needs to bound from below the prior probability of a 
Kullback-Leibler neighborhood of /o of size e n by e~ d ™ £ ™ for some d > 0. This follows from the 
conclusion of Theorem 5 in [23], which (modulo to the fact that our e n is within a logarithmic 
factor of theirs, as noted in Lemma 8 to accommodate our slightly different choice of cut-off 2 L "), 
provides the bound I4[||g — <?o||oo < 4e n ) > e~ nE ™. Switching from the sup-norm on g — go to the 
Kullback-Leibler divergence between g and go follows from Lemma 3.1 in [23]. □ 

Lemma 10. Suppose /q belongs to C a [0, 1], for some a > 1/2. Let H be a prior on histogram 
densities defined by (15). Then, for M large enough 

E%n[f, h(f,fo) < M(logn/n)-5*r |*W] -+ 1. 

Proof. The rate follows from an application of the general rate theorem in [7]. To check the 
prior mass condition, one uses their Lemma 6.1 on Dirichlet weights. If the weight sequence 
u!L n = (ujq, . . . , w 2 i„„i) has distribution T>(ao, • • ■ , a 2 £_i), and if (xq, . . . x 2 l„_i) is any point in 
the unit simplex Sl„ , then under (15), it holds, for some C > and any e > such that e 2 < 2~ L ™ , 



t 2 £V^^<£ 2 ]> e - c2L " iog(i/£) . 

1=0 



Set e n = (n/logn) 2a + 1 , then with L n as in (15) one gets that the previous display is bounded 
below by e _Cn6 ™. The Kullback-Leibler neighborhood of Condition (2.4) in [7] now easily relates 
to the L^type neighborhood in the last display. The entropy and sieve conditions of the general 
rate theorem are easily verified for this choice of e n , which concludes the proof. □ 

Lemma 11. Let A(£) be the Jacobian of the change of variables u — > Q given by (32) over the 
unit simplex 5 2 t„ . It holds, with /(C) = 2 L XL=o C^i L > 

A(0 = .Jo fZe^*V^f{Q(x)dx' 
Proof. This follows from elementary calulations, see [5]. □ 
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